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This  is  a final  report  on  research  carried  out  at  Technology  Service 
Corporation  under  Contract  No.  A151  F44620-76-C-0069.  The  length  of  the 
contract  was  one  year.  The  purpose  was  to  carry  out  research  leading  to 
more  effective  methods  of  analyzing  high  dimensional  data  sets.  The  moti- 
vation for  the  research  came  from  our  growing  realization,  in  handling  many 
high  dimensional  data  sets  gathered  in  many  different  fields,  that  classical 
methods  were  often  inappropriate  and  when  used,  lead  to  misleading  results. 

Our  main  efforts  over  the  year  were  in  three  areas: 

First:  Extensive  revisions  of  our  work  on  variable  kernel  density  estimates, 

leading  to  a paper  accepted  for  publication  by  Technometrics,  to  appear  in 
their  May  1977  issue. 

The  following  general  comments  by  tne  final  referee  of  the  paper  we 
find  particularly  interesting  as  we  believe  they  signify  a beginning  accep- 
tance by  the  statistical  community  of  the  need  for  new  methods  to  deal  with 


current  problems. 

REFEREE'S  REPORT 

GENERAL  COMMENTS 

This  is  an  interesting  paper,  describing  innovative  and  useful  research 
on  estimating  multivariate  density  functions  in  a computationally  feasible 
way.  The  mathematical  presentation  is  reasonably  clear,  and  the  simulation 


Secondly:  Under  a previous  AFOSR  contract,  a novel  goodness- of-f it  test 

devised  by  Leo  Breiman  had  been  tested  under  numerous  simulations.  These 
led  to  the  conjecture  that  the  test  was  asymptotically  distribution  free. 
The  first  version  of  the  paper  was  submitted  to  JASA  and  rejected  on  the 
grounds  that  the  main  conjecture,  while  made  plausible  by  the  simulations, 
was  not  proven.  Towards  the  end  of  the  contract  period,  Leo  Breiman 
working  in  collaboration  with  Professor  Peter  Bickel,  chairman  of  the 
Statistics  Department  at  U.  C.  Berkeley,  managed  to  find  a proof  that 
established  the  asymptotically  distribution  free  property  of  the  test. 

This  is  currently  being  written  up  for  submission  to  the  Annals  of  Sta- 
tistics. We  consider  this  to  be  a highly  significant  break  through.  It 
provides  the  first  computationally  feasible  consistent  and  asymptotically 
distribution  free  goodness-of-fit  test  for  dimensions  higher  than  one. 

Third:  The  most  exciting  research  for  us  over  the  past  year  has  been  the 
development  of  free-structured  classification  methods  and  the  growing  real- 
ization of  their  potential  in  approaching  a large  variety  of  problems  that 
were  untouchable  by  classical  methods.  The  progress  in  this  work  was  re- 
ported on  by  Leo  Breiman  at  a joint  U.  C.  Berkeley-Stanford  Statistical 
Colloquim  in  October  1976  and  generated  a good  deal  of  interest.  Numerous 
requests  for  written  descriptions  of  the  work  have  been  received.  However, 
to  date,  we  consider  our  work  in  an  exciting  but  exploratory  phase.  We  are 
looking  forward  to  further  developments  and  applications.  A write  up  of 
our  progress  to  date  and  directions  we  want  to  explore  in  the  future  are 
contained  in  an  appendix. 
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In  summary,  we  feel  that  a significant  amount  of  innovative  and  useful 
research  has  been  accomplished  over  the  contractual  period.  Rather  than 
launch  into  a long  discussion  of  why  the  results  are  important,  we  prefer 
to  let  the  contents  speak  for  themselves. 


APPENDIX  A 


Variable  Kernel  Estimates  of  Multivariate  Densities  and  Their  Calibration 

Leo  Breiman 
William  Meisel 
Edward  Purcell 


1.  Introduction  and  Summars 


Given  points  selected  independently  from  some  unknown  under- 

lying density  f (x.)  in  M-dimensional  Euclidean  space,  the  problem  is  to 


estimate  f(x).  To  date,  the  most  effective  general  method  is  the  Parzen 
approach:  select  a kernel  function  k(x)>0,  with 


/ k(x)dx  = 1 


(1) 


Usually  k(x}  satisfies  some  additional  conditions;  unimodality  with  peak 
at  x=0,  smoothness,  symmetry,  finite  fir.t  and  second  moments,  etc.  In 
fact,  in  actual  practice,  the  most  frequeitly  used  kernel  is  a Gaussian 
density. 

Having  selected  a kernel,  then  the  estimate  is  given  as 


As  n increases  the  shape  factor  o can  be  decreased  giving  greater  resolution 
for  larger  sample  sizes.  The  asymptotic  mean  square  consistency  of  these 
estimates  is  well  known  [1],  and  under  smoothness  conditions  on  f(xj 
asymptotic  rates  of  convergence  of  the  mean  squared  error  can  be  derived. 


However,  in  terms  of  practicalities,  the  situation. is  far  from 
satisfactory. 


First:  It  is  obvious  that  a Parzen  method  of  estimation  cannot 
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respond  appropriately  to  variations  in  the  magnitude  of  f(x).  For 
instance,  if  there  is  a region  of  low  f(x)  containing,  say,  only 
one  sample  point  x^,  then  the  estimate  will  have  a peak  at 
and  be  too  low  over  the  rest  of  the  region.  In  regions  where  f(x)  is 
large,  the  sample  points  are  more  densely  packed  together,  and  the 
Parzen  estimate  will  tend  to  spread  out  the  high  density  region. 

Thus,  the  problem  is  that  the  peakedness  of  the  kernel  is  not  data- 
responsive. 

Second:  None  of  the  asymptotic  results  give  any  generally  helpful 

leads  on  hrw  the  shape  factor  o should  be  selected  to  give  the  "best" 
estimate  of  unknown  density.  The  computed  rates  of  convergence  depend 
critically  on  f(x)  and  its  derivatives.  Even  if  one  tried  to  vary  o and 
got  a number  of  different  estimates,  the  question  remains:  which  one  is 

"best"? 

In  this  paper,  solutions  are  proposed  to  both  of  these  problems. 

First:  To  make  the  sharpness  of  the  kernel  data-responsive,  we  use 

the  class  of  estimates 


where  d.  . is  the  distance  from  the  point  x.  to  its  kth  nearest  neighbor, 

J »K  J 

and  is  a constant  multiplicative  factor.  The  intuitive  concept  is 


clear:  In  low  density  regions,  d.  . will  be  large  and  the  kernel  will  be 

J 9 K 

spread  out.  In  high  density  regions,  the  converse  will  occur. 

Second:  To  select  optimizing  values  of  k and  a^,  a goodness-of-fit 

statistic  S for  multivariate  densities  proposed  in  [2]  is  used  in  a 


procedure  that  searches  for  the  variable  kernel  parameters  that  minimize 
S. 

The  analytics  of  the  variable' kernel  estimates  situation  arc  a bit 
difficult  to  handle,  although  asymptotic  consistency  for  appropriate  kernels 
Is  easily  proved  under  the  condition  k/n+0.  To  get  a feeling  for  the  finite 
sample  situation  and  also  to  get  some  measure  of  assurance  that  our  proposed 
"solutions"  had  some  value,  we  ran  some  extensive  simulations  on  two  under- 
lying data  bases;  the  first  was  400  poii.ts  selected  from  a bivariate  normal 
distribution,  the  second  was  the  bimodai  distribution  consisting  of  a super- 
position of  two  bivariate  normals,  3/4  of  the  bivariate  normal  used  in  gen- 
erating the  first  data  set  plus  1/4  of  a normal  with  a much  sharper  peak. 

Three  measures  of  error  were  computed:  define  the  sample  mean  and 

variance  of  f(x_)  by 


I 

w I 


I.  Percent  of  Variance  Not  Explained  (PVNE) 

n - ,2 


PVNE  = 


-i-iwfixji-fuprxioo 

af  T 


II.  Mean  Absolute  Error,  Percent  (MAE) 


MAE  = 


xl0° 


III.  Mean  Percent  Error  (MPE) 


, n | -F(x - ) - f(x-)| 

KPE  - w j f(irrJ-'  ;;  100 


A large  number  of  runs  were  carried  out  w'th  the  two  data  bases  to 

(A)  Find  the  best  Parzen  estimator  and  the  best  variable  kernel 
estimator,  using  a symmetric  Gaussian  kernel  (naturally  the  "best"  values 
of  the  kernel  parameters  depend  on  what  measure  of  error  is  used). 

(B)  Compare  the  performances  of  the  two  types  of  estimators. 

(C)  To  see  whether  the  proposed  search  procedure  could  accurately 
locate  the  "best  fitting"  estimates. 

Our  conclusions  are: 

i.  In  all  cases  the  best  variable  kernel  estimate  was  superior  to  the 
best  Parzen  estimate.  The  best  Parzen  estimator  had  In  both  data  sets  about 

twice  as  much  mean  percent  error  (MPE)  and  percent  of  variance  not  explained 
(PVNE),  and  about  50%  more  mean  absolute  error  than  the  best  variable  ker- 
nel estimator. 
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ii.  The  S minimization  search  procedure  was  successful  in  locating 
the  region  of  parameter  values  where  the  variable  kernel  estimates  gave 
approximately  best  fits  to  the  actual  density. 

The  best  values  of  a for  the  Parzen  estimates  depended  on  which  measure 
of  error  was  used  much  more  than  the  variable  kernel  method  and  hence  would 
be  much  more  difficult  to  use  in  practice  (when  f is  unknown).  The  S mini- 
mization procedure  applied  to  the  Parzen  estimates  produced  values  of  a 
that  were  larger  than  most  of  the  "best"  values  and  could  not  be  called 
successful  in  this  context. 

During  the  course  of  the  study,  a number  of  interesting  and  useful 
properties  of  variable  kernel  densities  were  uncovered.  First  of  all,  the 
nearest  neighbor  distances  that  produced  the  best  fits  were  surprisingly 
large,  ranging  from  40  in  data  set  II  tc  100  in  data  set  I,  (actually  the 
fit  was  still  improving  at  k=100).  But  good  fits  can  be  produced  over  a 
very  wide  range  of  values  of  k,  as  long  as  satisfies  the  appro>imate  re- 
lation 

“k'3?2 

A- r = constant 

a* 

where  cT^  is  the  mean  of  the  k*^  nearest  neighbor  distances  and  o(dk)  is 
their  standard  deviation.  Our  tentative  conclusion  is  therefore  that  ac- 
tually  one  needs  to  find  only  the  single  parameter  value  [a^d^  /a (d^ ) 3 
to  calibrate  the  variable  kernel  estimates.  In  our  simulation  this  con- 
stant was  usually  about  3-4  times  larger  than  the  best  values  of  o for  the 
corresponding  Parzen  estimate. 

The  conclusion  that  the  mean  percent  error  is  markedly  different  be- 
tween the  two  types  of  estimators  has  important  implications  for  classifica- 
tion. The  method  giving  the  minimum  expected  misclassification  probability 


is  based  on  comparing  the  densities  of  the  different  classes.  One  common 
and  effective  method  of  getting  "good"  classification  boundaries  has  been 
to  estimate  the  class  densities  using  a set  of  points  that  have  already 
been  classified,  and  compare  the  estimates  to  make  the  classification  de- 
cision. Therefore,  if  this  is  the  intended  application,  then  the  mean 
percent  error  is  the  appropriate  error  measure  since  the  tails  of  the  dis- 
tribution are  important,  and  in  this  perspective  the  variable  kernel  es- 
timates are  decidely  superior  to  the  Parzen  estimates. 

An  important  consideration  is  the  variability  of  the  underlying  den- 
sity. If  it  is  more  or  less  uniformly  smooth  (as  in  the  first  data  base), 
the  adaptive  capability  of  the  variable  kernel  method  does  not  help  us 
much  as  iii  situations  where  the  density  is  more  variable,  i.e.,  has  a 
number  of  peaks  of  different  sharpness  (as  in  the  second  data  base). 

There  is  a large  body  of  published  literature  regarding  density 
estimation  and  a number  of  good  surveys  are  available  [3],  [4],  [5]. 

The  ktl1  nearest  neighbor  estimator  [6]  is  the  only  method  that  is  adaptive 
to  local  sample  density.  If  the  distance  from  a point  x to  its  kth  nearest 
neighbor  is  d,  then  the  estimate  is  defined  as 


i/y\  _ k/ n 

f(x'  " vtdy 

where  n is  the  total  number  of  samples,  and  V(d)  is  the  volume  of  the  M- 
dimensional  sphere  of  radius  d.  The  drawback  to  this  type  of  estimate  is 
that  it  is  discontinuous  and  that  it  does  not  satisfy  (1).  The  variable 
kernel  approach  offers  a combination  of  the  desirable  smoothness  prop- 
erties of  the  Parzen-type  estimators  with  the  data-adaptive  character  of 
the  k-nearest  neighbor  approach. 
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Furthermore,  the  variable  kernel  method  carries  very  little  computa- 

XL 

tional  penalty.  The  distance  from  a given  point  to  the  ktr‘  nearest  point 

is  computed  only  once  and  stored  for  all  the  calibration  runs.  An  algorithm 

constructed  by* Friedman,  et  al.  [7]  reduces  the  finding  of  all  kth  nearest 

2 

neighbors  to  n log  n time  instead  of  n . 

In  Section  2 we  will  describe  the  simulations  in  more  detail  and 
give  some  tabular  and  graphical  summaries  of  the  results. 

Section  3 will  give  a brief  description  of  the  goodness-of-fit  sta- 
tistic and  give  tabular  and  graphical  summaries  of  its  performance. 

In  Section  4,  the  behavior  of  the  estimates  will  be  summarized,  the 
selection  of  k and  a related  to  the  interpoint  distance  distribution,  and 
a description  given  of  some  early  and  unsuccessful  efforts  at  variable 
kernel  estimates. 

The  variable  kernel  method  has. been  described  in  short  course  notes 
on  pattern  recognition  prepared  by  one  of  the  authors  and  dating  back  to 
1973.  The  work  in  this  present  study  has  been  reported  on  in  the  Confer- 
ence on  the  Interface  Between  Computer  Science  and  Statistics  on  February  14, 
1975  [8],  In  June,  1975  we  learned  that  T.  J.  Viagner  has  submitted  a paper 
[9]  to  the  IEEE  Trans.  Information  Theory  which  is  also  concerned  with  the 
variable  kernel  estimates.  Since  his  paper  is  reportedly  concerned  with 
conditions  for  asymptotic  consistency,  particularly  in  one  dimension,  there 
does  not  seem  to  be  any  overlap. 
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2.  The  Simulation  and  Its  Results 

The  two  data  sets  mentioned  in  the  introduction  were  generated  as 
follows: 

Set  I:  400  points  selected  independently  from  the  density  f,  a 

bivariate  normal  with  mean  m = (0,0)  and  unit  covariance  matrix. 

Set  II:  400  points  selected  independently  from  the  density  g,  where 

g = . 75f  + .25f-j 


where  f is  as  above,  and  f-j  is  normal  with  parameters 


m * (3,3), 

where  r is  the  covariance  matrix. 


t 


The  kernel  for  both  types  of  estima  ors  was  a zero  mean  bivariate  normal 
density  with  unit  covariance  matrix. 

Figure  1 is  a graph  of  the  three  error  measures  in  data  set  I 
as  a function  of  the  shape  parameter  a of  the  Parzen  estimators. 

Figure  2 is  a graph  of  the  three  error  measures  for  data  set  I,  where 
we  selected  k = 100  and  varied  the  multiplicative  parameter  a. 

Figures  3 and  4 are  the  analogous  graphs  for  data  set  II,  where  we 
have  used  k = 40  in  the  variable  kernel  graph. 

In  all  cases,  we  ran  the  simulations  until  the  minimal  values  of  the 
three  measures  of  error  were  found,  both  for  the  Parzen  and  variable 
kernel  estimators.  For  the  variable  kernel  estimators .we  ran  the 


simulations  for  k=10,  20,  30,  40,  50,  and  60  in  both  data  sets,  and  for 
k=70,  80,  90,  100  in  data  set  I.  Table  1 below  summarizes  the  comparison 
between  the  methods. 


To  illustrate  the  resulting  fits  more  visually,  we  plotted  3 di- 
mensional graphs  of  the  best  estimates.  For  data  set  I,  we  used  o = .35  for 
the  Parzen  estimator  and  k = 60,  a = .6  for  the  variable  kernel  estimator. 

In  data  set  2,  the  choice  of  an  "optimal"  a was  more  problematical.  We 
settlea  on  .275  as  a reasonable  compromise.  For  the  variable  kernel  we 
took  K = 40,  a = .5.  The  results  are  shown  in  figures  5,  6,  7,  and  8 (see  end), 


Parzen, 

Data  Set  I 

Variable  Kernel, 
Data  Set  I 


Parzen, 

Data  Set  II 

Variable  Kernel, 
Data  Set  II 


Minimum  Mean 
Percent  Error 


19.0 

10.8 


34.7 

22.5 


Minimum  Percent  of 
Variance  Not 
Explained 


6.2 

3.i 


13.4 

6.2 


Minimum  Mean 
Absolute  Error, 
Percent 


11.6 

8.0 


24.2 

16.5 


Table  1 

Fortunately,  tne  variable  kernel  results  were  surprisingly  unsensitive 
to  the  choice  of  k.  Table  2 below  gives  the  minimum  values  of  the  measures 
of  error  for  the  different  values  of  k.  Note  that  in  both  examples,  values 
of  k over  almost  the  entire  range  give  quite  comparable  error  measurements. 

As  k varies  the  fit  behaves  slightly  different  for  the  two  data  sets. 
For  the  smooth  density  of  the  first  example,  the  error  measures  are  still 


decreasing  at  k=100  and  we  would  probably  have  gotten  slightly  better 
results  by  going  on  to  larger  k.  For  the  second  density  the  error  measures 
decrease  up  to  k=40  and  then  increase  at  k=50  and  60,  (except  for  the  MPE). 

Data  Set  I 


Minimum  Mean 


Minimum  rercent 
of  Variance  Not 
Explained 


Minimum  Mean 
Absolute  Error 
Percent 


10 

20 

30 

40 

12.9 

12.8 

12.2 

12.1 

9.3 

6.8 

6.3 

5.9 

11.7 

11.2 

10.7 

10.3 

80 

90 

11.3 

10.9 

4.1 

4.0 

8.6 

8.5 

Data  Set  II 


20  30  40 


Minimum  Mean  Percent 


Minimum  Percent  of 
Variance  Not 
Explained 


Minimum  Mean 
Absolute  Error 
Percent 


24.5  23.8 


17.9 


Table  2 


23.0 

22.6 

6.8 

6.2 

17.1 

16.5 

17.2  16.9 


While  the  best  fit  for  each  value  of  k in  a wide  range  has  about  the 
same  error  measures,  the  values  of  the  multiplier  a at  which  the  minimum 
errors  occur  vary  considerably  but  systematically  as  k increases.  We  will 
explore  this  further  in  Section  4. 
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3.  The  Goodness-of-Fit  Criterion 

Since,  in  practice,  the  underlying  f(>l)  is  not  known,  the  various  error 

i 

measures  cannot  be  computed.  This  brings  us  to  the  second  question  posed  i 

in  the  introduction:  How  then  do  we  go  about  selecting  o or  and  k. 

(Although  we  surmise  that  in  actuality  we  need  to  estimate  only  the 
optimal  single  parameter  value  X - a^d^)  /a(d[()  in  the  variable  kernel 
estimates.) 

In  [2]  a goodness-of-fit  criterion  for  a set  of  samples  to  a proposed 
density  f(x_)  was  developed  based  on  the  fact  that  if  f(x_)  is  the  true  den- 
sity,  then  the  variables 


-nf(x,)V(d,  ,) 

w.  = e 3 3,1 

J 


j 1 , . . . ,n 


where  V(r)  is  the  volume  of  an  M-dimensional  sphere  of  radius  r,  hav;'  a 
univariate  distribution  that  is  approximately  uniform.  Thus,  the  test 
statistic  for  an  estimate  f(x)  is  based  on  the  variables 

-nftxjVfd,  ,) 

= A ^ J > 1 


Wj  = e 


J 1 » • • • »n  . 


Let  ..<w^j  be  the  ordered  permutation  of  the  w^.  Then  the  test 


statistic  S is  defined  as 


n „ 


§ ■ | <«(j)  - £>2  • 

One  question  of  great  interest  to  us  in  this  study  was  whether  we  could 
select  "good"  values  of  a or  k and  by  searching  for  a minimum  in  S. 


The  results  were  affirmative  (with  one  exception  we  will  discuss  later), 


J 
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Naturally,  different  error  measures  were  generally  minimized  at  different 
values  of  the  parameters.  In  Table  3 we  list,  for  every  value  of  k used, 
the  value  of  a that  minimizes  each  error  measure  and  the  value  of  a 

a 

that  minimizes  S for  that  value  of  k. 

* 

For  the  unimodal  case  the  absolute  minimum  of  S occurs  at  k=100,  a=.5. 

At  this  point  we  have 

Mean  Percent  Error  = 12.5  (10.8) 

Percent  of  Variance  Unexplained  = 4.2  (3.6) 

Mean  Absolute  Error,  Percent  = 8.8  ( 8.0)  . 

The  figures  in  parentheses  are  the  minimums  of  the  corresponding  measures 
of  error  over  all  ranges  and  do  not  occur  at  a common  value  of  k and  a. 

In  the  bimodal  case,  the  minimum  of  S occurred  in  the  original  runs 
at  k=60,  .<=.4.  The  values  at  this  point  were  fairly  close  to  the  minimums. 


i.e.. 

Mean  Percent  Error  = 22.8  (22.5) 
Percent  of  Variance  Unexplained  = 10.7  ( 6.2) 
Mean  Absolute  Error,  Percent  = 18.8  (16.5)  . 


For  the  Parzen  Estimator  with  data  set  I,,  the  minimizing  values  of 
a for  the  three  error  measures  above  were  .40,  .35  and  .30  respectively. 
The  minimum  value  of  S occurred  at  .60.  For  data  set  II,  the  minimums 

A 

occurred  at  .400,  .175,  .225  and  the  minimum  of  S at  .375.  For  Parzen 
estimators  S indicates  "optimal"  values  of  o considerably  higher  than 
the  values  of  a that  minimize  the  PVNE  and  the  MAE.  There  is  also 
less  consistency  between  the  error  measures  as  to  the  location  of  the 
respective  minimizing  o.  The  a that  minimizes  the  mean  percent  error  is 


Table  3 

Minimizing  Values  of  a^. 


k = 

10 

20 

30 

40 

50 

60 

70 

80 

90 

100 

Mean  Percen 
Error 

t 

1.4 

1.0 

0.8 

0.7 

0.6 

0.5 

0.5 

0.4 

0.4 

Percent  of 
Variance 
Unexplai ned 

1.8 

1.2 

1.0 

0.8 

0.7 

0.6 

0.5 

0.5 

0.5 

0.4 

Mean  Absolute 
Error,  Percent 

1.5 
or 

1.6 

1.0 

0.9 

0.7 

0.7 

0.6 

0.5 

0.5 

0.4 

0.4 

S 

1.7 

1.2 

0.9 

0.8 

0.7 

0.7 

0.6 

0.6 

0.5 

0.5 

DAT  \ 5ET  II 


Mean  Percent  Error 


Percent  of  Variance 
Unexplained 


Mean  Absolute  Error 
Percent 


10 

20 

30 

40 

50 

1.4 

0.9 

0.7 

0.6 

0.5 

1.0 

0.6 

0.5 

0.4 

0.3 

1.0 

0.6 

0.5 

0.4 

0.3 

1.1 

0.8 

0.6 

0.5 

0.5 
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the  highest,  and  in  the  bivariate  case,  considerably  higher  than  the 
other  two  minimizing  values  of  cr.  Probably  this  latter  fact  is  due  to 
the  behavior  of  the  Parzen  estimates  at  small  values  of  f(x). 

In  both  data  sets,  the  S estimate  of  a gives  a value  of  mean  percent 
error  close  to  the  minimum  attainable  for  the  data  set.  This  is  consistently 
true  for  the  variable  kernel  estimates  also.  For  each  value  of  k,  the  S 
minimizing  value  of  has  a mean  percent  error  close  to  the  minimum  possible 
for  that  value  of  k. 
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4.  Mean  Interpoint  Distance  and  the  Choice  of  a 

In  our  various  explorations  of  the  variable  kernel  estimates,  we  made 
the  empirical  discovery  that  over  the  range  of  k investigated,  that  for 
both  data  sets 

a k^^P  = constant 

where  and  o(dk)  are  the  mean  and  standard  deviation  of  the  kth  nearest 
neighbor  distances  for  the  data  set,  and  is  the  "optimal"  a for  that 
value  of  k.  To  illustrate  this,  we  use  as  the  "optimal"  value  of  a^,  the 
average  of  the  first  three  minimizing  values  given  i-n  Table  3.  Table  4 

O 

gives  the  values  of  a^CcTj^)  /a (dk ) . 

The  constant  decreases  about  40%  between  the  two  data  sets.  A .imi- 
lar  decrease  occurs  for  those  value  of  a in  the  Par2en  Estimates  whic* 
minimize  the  Mean  Absolute  Error  % and  the  Percent  of  Variance  Not  Ex- 
plained. It  seems  clear  that  the  increase  in  optimal  kernal  sharpness 
occurs  in  order  to  deal  with  the  increased  variability  in  data  set  #2. 

At  the  beginning  of  this  study,  we  used  distances  to  the  closest 
neighbor,  next  closest  neighbor,  etc.,  up  to  the  fifth  nearest  neighbor. 
The  results  were  disastrous.  Examining  the  errors,  they  came  mainly  from 
a few  points  that  were  too  close  together.  We  tried  a number  of  things: 

i.  Selecting  a lower  bound  D for  the  interpoint  distances  and  using 

dj,k  = max(D»dj»|c) 

In  the  kernel  estimate  of  d.  . . D was  selected  as  a percentile  (usually 

J 9 K 

either  the  5th  or  10th)  of  the  d.  . , j=l,...,400. 

J 


c 
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ii.  Using  a weighted  average  of  the  first  k nearest  neighbor  distances. 

iii.  Selecting  a multiplicative  constant  ak  and  using  d^k  or 

I 

“k  dj,k' 

None  of  these  helped  very  much  as  long  as  we  kept  working  with  k small.  The 
averaging  in  (ii)  was  no  help.  Later  we  made  a theoretical  computation 
in  order  to  find  values  a-j,...»«k  with 

k 

Oj >0 * i=l , . • . ,k  . ^ cx.j  - 1 

and  such  that  the  variance  of 

\ «i  d0.l 

is  a minimum.  Assuming  that  the  density  w.s  “locally  constant"  so  that  the 
distribution  of  points  is  “locally  Poisson,'  the  answer  is 

ot*j  * ct 2 — ...  ~ ~ Of  — 1 

This  result  gave  us  some  insight  into  the  failure  of  the  averaging  process. 

In  (iii)  we  found  that  trying  to  get  more  smoothing  by  increasing  a k 
led  to  serious  underestimates  of  the  peaks  of  the  densities. 

Nothing  really  helped  until  we  started  exploring  the  larger  values  of 
k and  found  that  (iii)  worked  well  when  k was  large  enough. 

In  terms  of  what  has  been  empirically  learned  in  this  study,  we  tenta- 
tively propose  the  following  method  for  calibrating  a variable  kernel  den- 
sity estimate. 


r 


Step  1.  Pick  an  initial  k equal  to  some  fraction  of  the  sample  size,  say 
10%,  or  by  plotting  cTj^  versus  k and  taking  a value  of  k past  the  knee  of 
the  curve  (see  figure  9). 

A 

Step  2.  Do  a search  for  the  value  of  that  minimizes  S. 

Step  3.  Using  the  minimizing  value  compute 

. “k(3p2 

1 w 

Step  4.  Vary  k in  both  directions,  selecting  so  as  to  hold  the  above  ratio 

A 

constant  and  search  for  a k value  that  minimizes  S. 

Note  that  Step  3 may  be  dimension  dependent. 


*mm » a » a**  jw: 
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APPENDIX  B 


TREE  STRUCTURED  CLASSIFICATION  METHODS 


Background 

Technology  Service  Corporation  has  been  performing  research  under  a 
variety  of  projects  involving  classification  or  categorization. 

For  instance,  one  project  has  involved  the  recognition  of  ship  classes 
by  means  of  their  radar  range  profiles.  In  another  algorithms  were  developed 
for  the  recognition  of  spoken  words  using  different  speakers.  A third  pro- 
ject involved  the  classification  of  chemical  compounds  through  their  mass 
spectra. 

The  nature  of  these  problems  is  such  that  classical  classification 
techniques,  such  as  the  use  of  Fisher  discriminants,  etc.,  are  virtually 
useless.  Even  recently  developed  techniques  such  as  the  use  of  nonparametric 
density  estimates  or  nearest  neighbor  methods  are  largely  non  applicable. 

The  common  elements  that  make  these  problems  different  and  difficult  is 

1.  The  measurement  vector  characterizing  each  object  is  very  high- 
dimensional . 

2.  The  number  of  classes  is  large. 

3.  The  number  of  classified  samples  is  small,  relative  to  the  dimen- 
sionality and  the  number  of  classes. 

To  do  effective  decision  making  in  these  problems  we  have  turned  to 
tree  structured  decision  methods.  The  simplest  type  of  decision  tree  works 
this  way:  denote  the  measurement  vector  attached  to  an  object  by  x and  let 

it  take  values  in  a space  X.  A node  of  the  tree  corresponds  to  a subset 
EcX.  If  xeE  then  the  decision  is  made  to  pass  the  object  to  the  left  hand 
node.  Otherwise  it  goes  to  the  right. 
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The  branches  of  the  tree  end  at  terminal  nodes.  Each  terminal  node  is 
assigned  to  one  of  the  classes.  When  an  object  passes  down  the  tree  and 
hits  a terminal- node,  it  is  labeled  as  being  in  the  assigned  class. 

This  tree  structure  is  in  a way  fairly  simple.  It  is  binary  in  the 
sense  that  each  node  splits  into  two  descendcnt  nodes,  and  the  decision 
rules  arc  r.M-randomized.  Still  it  is  the  prototype  of  tree  structured 
classification  methods. and  its  successful  application  in  a number  of  prob- 
lems has  been  the  stimulus  for  our  recent  and  proposed  research  into  this 
area. 

To  understand  why  decision  trees  appear  to  us  to  hold  great  promise 
in  high-dimensional,  numerous  class  problems,  think  of  the  simple  problem 
of  constructing  a word  dictionary,  containing  thousands  of  entries.  In  prac- 
tice, we  use  a real  dictionary  very  easily  and  naturally,  without  realizing 
what  an  effective  tool  it  is.  First,  we  look  at  the  first  letter  of  the 
word.  This  separates  all  words  into  26  disjoint  subsets.  Having  located 
which  subset  we  are  in,  we  now  look  at  the  2nd  letter  of  the  word.  This 
splits  the  original  subset  into  26  2nd  generation  subsets.  We  continue 
until  the  word  is  located. 


The  pointis  that  the  decision  is  not  made  all  at  once.  The  large 
amount  of  information  carried  by  the  succession  of  letters  in  the  word 
is  not  used  at  one  gulp  to  decide  which  one  of  the  thousands  of  possible 
cases  the  word  fits  into.  Instead,  a very  interesting  and  practical  strategy 
is  used.  Starting  with  a very  limited  amount  of  information,  namely,  the 
first  letter  of  the  word,  a rough  classification  into  a small  (26)  number 
of  groups  is  done.  Then  more  information  (the  2nd  letter)  is  adjoined,  and 
a finer  subdivision  is  made,  and  so  on. 

The  point  is  that  the  high  intrinsic  complexity  of  the  problem  is 
broken  down  into  a sequence  of  steps  in  which  highly  aggregated  information 
is  used  to  separate  a group  of  objects  into  a relatively  small  number  of  sub- 
groups. 

We  have  emphasized  the  above,  at  the  risk  of  being  overly  simplistic,  in 
order  to  clarify  the  simple  but  powerful  idea  trat  underlies  decision  trees. 

We  know  of  no  other  practical  method  for  effectively  solving  problems  char- 
acterized by: 

o high-dimensionality 
o numerous  classes 
• small  sample  size. 

By  aggregating  the  information  and  aplitting  each  group  into  only  a few  sub- 
groups at  each  stage,  one  deals  with  a sequence  of  problems  having  considerably 
lower  dimension  and  fewer  subgroups.  Thus,  the  sample  size,  relative  to  di- 
mensionality and  number  of  subgroups  becomes  much  larger. 


One  very  important  distinction  between  classical  pattern  recognition 
methods  and  a decision  tree  structure  is  in  the  way  that  information  is 
utilized.  In  the  usual  pattern  recognition  approach,  the  dimensionally 
is  made  reasonable  by  the  selection  of  an  apriori  mapping  from  measure- 
ment space  to  "feature"  space.  That  is,  depending  on  certain  physical 
or  heuristic  principles,  the  large  amounts  of  detailed  information  regard- 
ing any  one  object  are  aggregated  and  summarized  in  a small  number  of 
variables  that  comprize  the  feature  vector. 

The  reason  for  this  mapping  is  usually  very  practical:  Since  the  usual 

pattern  recognition  algorithms  give  a one  gulp  answer,  a drastic  reduction 
in  dimensionality  is  necessary  both  to  make  the  sample  size  sufficiently 
dense  in  the  space  to  define  the  problem  and  to  make  it  compatationall^ 
feasible. 

But  having  made  the  reduction  in  dimensionality,  one  is  stuck  with 
it.  The  loss  in  information  is  irrevocable. 

Even  if  the  pattern  recognition  phase  of  the  problem  reveals  that 
additional  information  would  be  useful  in  some  regions  of  the  space,  it  cannot 
be  made  available  without  a reworking  of  the  problem. 

The  trouble  is  that  the  attempt  is  made  to  work  the  problem  in  two 
separable  non-interacting  pieces:  one  is  the  feature  selection.  The 

second  is  the  classification. 

However,  in  a tree  decision  structure,  there  is  possible  a sequential 
interaction  between  classification  and  information.  As  one  progresses 
down  the  brandies  of  the  tree,  more  and  more  detailed  information  can  be 
called  for  by  the  tree  construction  algorithm.  As  a verbal  analogy,  is 


as  though  at  each  mode,  the  tree  construction  could  call  upstairs  and  say 
“if  you  want  me  to  do  any  more  separation  on  these  things,  then  you've 
got  to  give  me  some  more  information." 

However,  the  problem  is  not  resolved  simply  by  resolving  to  use  a de- 
cision tree  structure.  To  a great  extent,  the  problem  has  been  shifted 
into  new  ground.  It  now  becomes:  how  does  one  determine  the  most  effec- 

tive or  an  effective  sequence  of  decision  problems  defining  the  tree?  That 
is,  what  question  does  one  ask  at  each  node  of  the  tree? 

Put  into  a statistical  context  the  problem,  simplified  to  binary  trees 
is  this:  given  that  the  data  vector  x.  is  drawn  with  probability  p^  from 

the  distribution  P^(d><)  corresponding  to  the  i ' class,  find  a sequence  of 
binary  decis'on  questions  of  the  form,  “is  *eE?"  that  leads  to  a near 
maximal  probability  of  correct  classification. 

In  practice,  neither  the  a priori  class  probabilities  p^  or  the  class 
distributions  P^(dx)  are  known.  Instead,  one  has  on  hand  a set  of  objects 
and  corresponding  data  vectors  whose  classification  is  known.  The  hub  of  the 
problem  is  to  use  these  to  construct  an  effective  decision  tree. 

Obviously,  by  using  as  many  terminal  nodes  as  there  are  objects  in 
the  learning  set,  we  can  usually  get  perfect  classification  on  the  learning 
set.  Thus,  unrestricted  tree  growing  using  the  learning  set  alone  will 
lead  to  nonsensical  results.  Restrictions  need  to  be  placed  on  the  com- 
plexity of  the  tree  relative  to  the  sample  size,  and  some  of  the  already 
classified  objects,  chosen  in  some  random  way,  put  aside  to  act  as  an  eval- 
uation or  "test  set"  for  the  tree. 

Trees  share  the  complex  character  of  all  sequential  decision  methods: 


a decision  made  at  a point  affects  all  subsequent  decisions.  Thus,  it  is 
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difficult  to  evaluate  what  is  optimal  at  each  node.  With  high-dimensional 
data  vectors,  one  faces  an  enormous  amount  of  information  and  at  each  stage 
of  tree  building  what  is  wanted  is  to  use  some  low  dimensional  aggregation 
or  "averaging”  of  the  information.  But  what  averaging  or  aggregation  is 
effective,  and  how  the  effectiveness  should  be  measured  are  difficult  ques- 
tions to  resolve. 

Thus,  there  are  important  problems  that  need  further  resolution  in 
order  to  sharpen  the  use  of  decision  trees  as  a practical  classification 
tool.  Technology. Service  Corporation  has  developed  some  methods  for  tree 
growing  and  applied  them,  with  very  promising  results,  to  problems  as  di- 
verse as  ship  classification  with  the  radar  range  profile  as  the  measurement 
vector  and  clemical  classification  with  the  mass  spectra  as  the  measurement 


vector. 

In  the  body  of  this  report,  we  will  outline  our  largely  unpublished 
recent  research  into  decision  tree  construction  and  application.  Then  we 
will  discuss  the  directions  where  we  believe  that  further  research  is  needed. 

The  development  of  decision  tree  methodology  has  important  implications 
for  implementation  in  actual  on-line  recognition  and  classification  systems. 
The  tree  is  developed  off-line  using  the  given  "training  set."  This  is  the 
difficult  and  time-consuming  effort.  But  the  on-line  tree  consists  of  a 
sequence  of  very  simple  questions,  i.e.,  a sequence  of  yes-no  questions  for 
a binary  tree.  Thus,  on-line  classification  can  be  very  rapid.  Furthermore, 
as  the  more  interesting  and  difficult  recognition  problems  move  toward 
higher  dimensions,  with  more  information  being  extracted  concerning  each 
object,  tree  structuring  decision  processes  become  increasingly  appropriate 
and  useful . 
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At  the  beginning,  much  of  our  tree  construction  was  based  on  heuristics 
and  trial  and  error,  with  the  resultant  misclassification  rate  being  our 
gauge  of  success.  Over  the  last  year,  we  have  been  experimenting  with 
algorithms  for  the  systematic  generation  of  "good"  binary  trees. 

Our  best  performance  to  date  has  been  based  on  the  following  al- 
gorithm: At  any  node,  suppose  there  are  n objects  of  the  learning  set 

with  associated  measurement  vectors  x^,...,)^*  Suppose  that  of  these 
n objects,  n-j  are  in  class  #1,  r\2  in  class  #2,...,  and  nj  in  class  )rJ. 

f 

Suppose  that  we. have  defined  some  family,v  of  potential  splits  at 
this  node.  Each  split  sends  a subset  of  the  n objects  to  one  node,  and 
the  complementary  subset  to  the  other  node.  The  splits  are  based  on 
the  values  of  the  measurement  vectors  a, , i=l , . . . ,n.  That  is,  each 
split  in  J-  is  based  on  a question  of  Ihe  form 

Is  x e E? 

Hence,  in  general,  the  family  ,/is  constructed  by  selecting  a family  (Es> 
of  subsets  of  X and  looking  at  the  potential  splits  generated  by 

Is  x e E$? 

At  this  point,  one  would  like  to  select  the  "best"  split  in,/  The  pro- 
blem is  how  to  define  "best". 

A split  at  any  node  impacts  all  the  nodes  below  it.  Therefore,  in 

l 

judging  how  good  a split  is,  one  would  theoretically  have  to  trace  the 
subsequent  developments  of  all  descendant  nodes.  A split  that  does  the 
best  possible  job  in  terms  of  the  two  imnediate  descendant  nodes  might 
not  look  very  good  when  the  tree  is  followed  down  for  another  generation. 
Thus,  there  are  levels  of  judging  the  goodness-of-spl it. 
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Analogously,  a novice  chess  player  might,  at  any  given  time,  choose 
the  move  that  most  improves  his  immediate  position.  But  a good  chess 
player  will  think  two,  three  or  more  moves  ahead  in  considering  the 
implications  of  the  correct  move. 

To  date,  we  have  concentrated  on  the  finding  of  good  criteria  for 
selecting  the  "best"  split  from  a family  J!  using  only  a one  generation 
analysis.  This  is  by  no  means  a trivial  matter,  because  the  choice 
of  a good  "strategic"  criteria  can  trade-off  against  a detailed  multi- 
generation analysis.  In  other  words,  using  the  chess  game  analogy,  if 
a player  uses  really  good  criteria  for  judging  his  improvement  in  current 
position,  his  criteria  will  embody  a good  deal  of  his  past  experience 
gained  from  learning  the  future  consequences  of  current  moves. 

Suppose  there  was  a split  that  separated  all  of  the  objects  in 
class  j into  one  node,  and  the  other  classes  into  the  other  nodes.  For 
a problem  with  a large  number  of  classes,  this  is  not  a strategically 

sound  split.  We  would  rather  get  a split  such  that  all  the  objects  in 

a large  subgroup  of  the  classes  went  into  one  node  and  the  remaining 

classes  into  the  other  nodes.  Therefore,  we  want  criteria  for  goodness- 

of-splil  that  rewards  the  latter  type  of  split  more  than  the  former. 

The  most  satisfactory  criterion  we  have  found  so  far  is  based  on 

the  uncertainty  measure.  For  any  node  N having  n.  objects  in  class  j, 

J 

j=l,...,J  define  the  uncertainty  at  that  node  as 

U(N)  = -£  (n^/n)  logCn^/n) 

J 

where  n = £ n..  Suppose  that  a split  produces  the  left  and  right 

J J 

Nr  nodes  with  nL  of  the  original  n objects  going  left  and  n^  = n-nL  going 


I fc'.,:  tO 


right.  Then  define  the  decrease  in  uncertainty  produced  by  the  split 
as 

AU  = U(N)  - (nL/n)U(NL)  - (nR/n)U(NR) 

This  criterion  generally  rewards  the  best  strategic  split.  For  in- 
stance, if  there  are  J = 2M  classes  and  if  a node  contains  equal  numbers 
of  objects  in  each  class,  then  the  splits  producing  the  largest  AU  places 
all  objects  in  M of  the  classes  in  o:.c'  descendent  node  and  the  remaining 
objects  in  the  other. 

The  algorithm  then  searches  over  all  possible  splits  in  S,  and  selects 
the  one  giving  the  largest  AU. 

The  algorithm  needs  one  more  piece  to  be  complete.  A stopping  rule  must 
be  specifier.  Otherwise,  as  many  terminal  nodes  as  there  are  objects  in  the 
test  set  v/ii’  be  produced.  We  are  currently  utilizing  the  rule:  Let  N be 

the  original  test  set  population  and  n,  the  node  population.  Set  a threshold 
a and  declare  the  node  terminal  if  there  is  no  split  in  ^ such  that 

jj-  AU  > a . 

The  adoption  of  this  rule  and  the  threshold  value  of  a used  were  set  by 
heuristics.  That  is,  we  generated  trees  that  went  down  to  very  small  ter- 
minal nodes  and  decided  where  on  the  branch  it  would  have  been  reasonable  to 
stop.  The  rule  was  then  constructed  to  more  or  less  match  our  statistical 
opinions. 

The  critical  element  in  this  tree  growing  procedure  is  the  selection 
of  the  family*?  of  potential  splits  at  each  node.  In  ship  recognition,  the 
measurement  vector  consisted  of  the  intensity  of  radar  returns  as  measured 
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every  two  feet  along  ships  ranging  up  to  500  feet  in  length.  Thus,  the 
measurement  vector  x had  a maximum  dimensionality  of  250.  The  class  J for 
all  nodes  below  the  1st  generation  was  generated  by  questions  of  the  form: 
"Does  the  range  profile  have  a local  maximum  in  the  interval  [a,b]?" 

The  ends  of  the  intervals  a,b  ranged  along  multiples  of  1/100  of  the  ship's 
length.  Thus  a and  b were  specified  by  giving  two  integers  L,M 
0 < L < M < 100,  and  consequently,  the  split  was  specified  by  L and  M.  Thus, 
the  family  J contained 


potential  splits  at  each  node. 

In  the  chemical  spectra  study,  the  measurement  vectors  consisted  of 
peak  intensifies  on  a scale  of  0 to  100  corresponding  to  every  integer  nt/e 
value  from  1 up  to  320.  Thus,  £ was  320  dimensional.  The  intensities  were 
divided  into  five  logarithmic  ranges,  sc  that  the  coordinates  of  x_  could  be 
considered  as  taking  values  in  the  set  [1;2,3,4,5}.  The  family.-/  was  gen- 
erated by  all  questions  of  the  form: 

"Is  the  intensity  at  m/e  = k greater  than  m?" 

In  other  words,  each  question  was  characterized  by  the  integers  k and  m with 
1 < k < 320  and  1 < m < 4.  Thus,  contained  about  1200  splits  at  each  node. 

New  Concepts 

The  results  of  these  above  two  studies  were  exciting,  in  that  we  could 
see  that  the  tree  structure  gave  us  a feasible  way  of  solving  problems  that 
had  previously  seemed  quite  untractable.  The  algorithm  used,  for  all  of  its 


crudeness  and  simplicity,  produced  reasonable  classification  results.  As  we 
worked  with  it,  the  drawbacks  and  deficiencies  became  apparent,  and  we  could, 
see  directions  where  improvements  could  produce  more  powerful  and  flexible 
methods.  We  have  outlined  some  of  these  in  the  sections  that  follow. 

Basically,  we  want  to  extend  and  generalize  the  realm  of  possible  tree 
structures.  One  direction  we  very  recently  came  across  is  the  use  of  ran- 
domized decision  rules  at  the  nodes.  This  leads  to  structures  we  have  called 
probability  trees  and  has  the  promise  of  resolving  two  serious  shortcomings 

we  have  found  in  practical  applications. 

Another  direction  where  basic  work  is  needed  is  in  information-adaptive 

trees.  The  point  here  is  to  allow  the  class  of  allowable  splits  to  change 
as  one  progresses  down  the  tree,  so  that  near  the  top,  coarse  overall  fea- 
tures are  used  and  more  detailed  information  is  added  to  discriminate  at  the 
lower  nodes.  Here,  possibilities  are  introduced  "looking  ahead"  to  check  the 
consequences  of  any  given  split. 

Another  possibility  is  that  of  using  other  splitting  criteria  and 
cleansing  terminal  nodes  have  too  much  of  a mixture  of  classes.  Finally, 
we  discuss  our  thinking  in  the  direction  of  the  "ultimate"  decision  tree 
classification  method,  as  applied  to  a truly  numerous  class  problem. 


Boundary  Problems,  Confidence  Statements,  and  Probability  Trees 

Analyzing  our  approach  using  binary  trees  and  splitting  questions 
of  the  form 

Is  x E? 

we  found  that  there  was,  at  times,  an  undesirable  sensitivity  to  the  boundary. 
That  is,  with  many  classes  to  discriminate  between,  the  algorithm  would  care- 
fully select  the  splitting  set  E to  include  most  objects  in  the  test  set  within 
a certain  set  of  classes  and  exclude  those  without.  Often,  a considerable  num- 
ber of  measurement  vectors  would  fall  near  the  boundary  of  E.  Then  when  the 
test  set  was  run,  a frequent  source  of  error  was  due  to  measurement  vectors 
that  just  m'ssed  being  on  the  right  side  of  the  boundary  of  E at  some  node  and 
were  con^uc  itly  misclassified. 

One  reason  for  this  behavior  of  the  boundaries  was  that  our  learning 
set,  although  large  by  ordinary  one  dimensional  standards  (336  for  ships) 
was  sparsely  spread  out  in  the  high  dimensional  space.  The  boundaries 
could  then  arrange  themselves  to  do  quite  well  on  classifying  the  learning 
set  and  not  too  well  when  another  sparse  set  (the  test  set)  was  randomly 
plunked  down. 

Me  improved  the  boundary  behavior  in  ship  recognition  by  taking  each 
learning  set  profile,  adding  random  noise  and  in  this  way  generated 
randomly  perturbed  profiles  from  the  original  profile.  But  we  considered 
this  an  artificial  baling  wire  and  glue  remedy. 

The  non-randomized  character  of  the  decision  roles  also  gave  us  another 
difficult  problem  to  solve.  When  an  unknown  object  goes  down  the  decision 
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tree  and  ends  in  a terminal  none  labelled  as  class  #J,  what  confidence 
we  assign  to  the  classification? 

To  begin  with,  the  terminal  node  was  labelled  as  class  #j  if  there 
were  more  class  #j  objects  in  it  than  any  other  class.  Some  terminal  nodes 
have  very  mixed  populations  and  are  terminal  because  no  further  discrimina- 
tion is  possible  using  the  class  of  splits  ^/and  the  given  stopping  role. 

Other  tc. minal  nodes  have  a very  high  percentage  of  their  population  in  one 
class.  This  difference  in  terminal  node  distribution  is  complicated  by  the 
fact  that  some  unknown  objects,  in  traversing  the  tree,  come  down  through 
nodes  at  which  their  measurement  vectors  are  very  close  to  the  boundary. 

Others  stay  aw^y  from  the  boundary  at  all  nodes. 

The  problem  did  not  seem  to  have  a really  satisfactory  formulation  of 
binary  decision  trees.  Although  we  conjured  up  some  measures  of  confidence, 
we  had  little  confidence  in  them. 

In  July  1976  we  have  started  thinking  about  a dif- 
ferent type  of  tree  structure  that  may  solve  both  the  boundary  problem  and 
the  problem  of  assigning  confidence  to  the  classification  of  an  unknown 
object. 

These  new  structures  are  called  probability  trees  and  are  described 
as  follows: 

Start  with  a learning  set  consisting  of  n objects  in  0 classes, 

C-|,...,Cj.  Number  the  objects  l,...,n.  A node  of  the  tree  corresponds  to 
a vector  p of  probabilities  (pj,...,p  ).  At  each  node,  define 

? ■ ^ Pj 

. a 


and  the  class  probabilities  P.  by 

J 


pj = £ pyi^i 

3 ieC.  3 

J • 


where  ieC.  denotes  the  sun  over  all  objects  in  class  j. 

0 

We  will  set  up  a procedure  for  splitting  this  node.  Instead  of  con- 
sidering a set  E and  asking:  is  x E?,  define  a family  ^ of  functions 

0(x)  on  X,  such  that  each  0(.x)  satisfies 


O<0 ( x ) <1  ' 

We  will  use  a particular  function  0 as  follows:  for  an  object  having 

measurement  vector  x,  put  it  into  the  leTt  node  with  probability  0(xj  and  the 
right  nG-__  w th  probability  l-(5(x).  Thus  we  get  the  picture 
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If  the  i.  object  has  measurement  vector  x ^ » and  its  • probability  in  the 
present  node  is  p^ , then  its.  probabil ity  in  the  left  node  is 


Pi  (L)  = 0(xi)pi 


and  in  the  right  node  is 


P-i  (R)  = O-0(xi))pi  . 

Thus  we  get  the  left  and  right  node  probability  vectors 

p(L)  = ( p-j  (L) Pn ( L ) ) 

P(R)  = (p1(R},...,Pn(R)) 


Note  that 


P (L ) + P(P>)  = P 


Also,  the  class  probabilities  P - (L ) and  P.(R)  are  defined  as  before,  i.e., 

J J 


pj(L) '-PTT7  J Pi<L> 

0 |P(L) | ieC • J 


Defining  uncertainty  in  a set  of  class  probabilities 


U(P)  = - z P,  log  P. 

j 3 3 

we  have  that  the  decrease  in  uncertainty  due  to  the  node  split  is 
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fiu  = U(PJ  - JStil  u(p(l) ) - MEIL  u(p(r)) 

0 ' IpI  Ip  I 

Choose  the  0 in  ^ that  maximizes  aU  (see  remarks  later). 

Continuing  this  way,  we  arrive  at  M terminal  nodes  Ti»...,T^  with  cor- 
responding  vectors  of  probabilities  p^,...,?^  and  with  the  m node  having 
class  probabilities  j=l 

Given  a test  object,  we  get  our  results  as  follows:  traversing  the 

tree  and  using  the  selected  function  0(x.)  at  each  node,  we  end  up  with 
probabilities  qm>  m=l,...,M  that  the  object  is  in  terminal  node  Tm.  Define 
the  probability  that  the  object  is  in  class  j by 

P (object  is  in  C^)  = 2 q 

Identify  the  object  as  being  in  that  class  having  the  largest  probability. 
Then  the  probability  of  misclassification  is 


1 - max  P(object  is  in  C.) 

j J ‘ 

A few  remarks:  The  standard  way  of  node  division,  i.e.,  asking  the 

question  is  _xeE?  can  be  formulated  as;  let 

fl,  if  xeE 


0(x)  = 


0,  othenvise 


and  sending  the  object  left  with  probability  0(x.)>  right  with  probability 
l-0(x).  Thus,  using  a 0(x),  subject  to  O<0(x)<l  is  a generalization  of 


the  above. 


Now,  it  is  impossible  and  undesirable  to  maximize  AU  over  too  large 
a class  of  functions  0(>O-  The  reasons  are: 

1.  One  does  not  want  to  allow  too  complicated  a set  of  decision  rules. 

2.  If  the  class  of  allowable  functions  0(>c)  is  too  large,  the  minimi- 
zation search  will  take  too  long. 

Thus,  the  essential  ingredient  for  this  method  to  work  is  the  selection  of  an 
appropriate  family  of  functions  to  maximize  Al)  over.  Probably  reasonable 
requirements  are  ' • 

. All  of  the  functions  0 in  & be  reasonably  smooth.  . 

. The  family  -7*  depends  only  on  a small  number  of  parameters. 

Note  that,  in  a sense,  a probability  tree  makes  use  of  fuzzy  decision 
•sets  E.  If  a is  well  within  or  outside  of  E it  is  sent  to  one  or  the  other 
node  with  probability  close  to  zero  on  one.  But  if  it  is  close  to  the 
boundary  the  decision  becomes  blurry. 

Interestingly  enough,  although  tiie  impedus  for  probability  trees  was 
to  get  rid  of  the  boundary  problem,  as  an  unforeseen  by-product  v/e  got  a 
natural  solution  to  the  confidence  problem  as  expressed  by  the  probability 
that  the  unknown  object  was  inisclassi fied. 

, Information-Adaptive  Trees 

A decision  tree  structure  that  uses  the  same  set  of  potential  splits 
at  every  node  has  some  serious  disadvantages.  It  is  generally  using  too 
much  information  during  the  early  part  of  the  tree  construction  and  not 
enough  during  the  later  part. 

For  instance,  in  the  chemical  spectra  data  base,  we  used  the  1600- 
dimensional  feature  vector  consisting  of  all  peaks  at  m/e  locations  between 


1 and  320  and  a separation  of  the  intensities  into  five  ranges.  That  gave 
the  1280  potential  splits:  Is  the  intensity  (coded)  of  the  peak  at  m/e  = k 

greater  than  j,  j=l,2,3,4  and  k -.1,2 320.  In  general,  only  a small 

fraction  of  the  spectra  in  the  data  base  had  a peak  of  any  given  m/e  value  whose 
intensity  was  greater  than  j,j=l,...,4. 

Thus,  at  the  early  stages  of  the  tree  we  had  too  much  splintering  with 
small  nodes  being  split  off.  In  the  later  part  of  the  tree  construction, 
the  spectra  at  any  node  have  been  filtered  down  by  their  common  responses 
to  a number  of  questions.  They  exhibit  more  and  more  similarity  the  lower 
down  in  the  tree  the  node  is.  Therefore,  to  effectively  split  a small  node, 
we  may  need  to  include  more  detailed  information  about  the  spectra,  or  to 
ask  more  detailed  questions  about  it. 

In  both  of  our  studies,  many  of  the  lo,v-confidence  small  nodes  cou'd 
not  be  effectively  split  by  one  of  the  available  splitting  questions.  This 
may  have  been  due  to  the  fact  that  the  set  of  potential  splits  was  not  able 
to  get  at  the  level  of  detail  needed  further  down  the  tree. 

Another  possible  cause  for  the  program's  difficulty  in  separating 
some  small  nodes  is  its  restriction  to  a single  stage  optimization  proce- 
dure. It  always  selects  the  split  which  gives  the  greatest  decrease  in 
uncertainty.  This  is  a "one-step  optimal  procedure."  But  usually,  the 
succession  of  two  "one-step"  optimal  splits  will  not  be  the  "two-step" 
optimal  split.  Instead  of  judging  the  merit  of  a split  on  how  much  the 
split  reduces  the  uncertainty,  we  could  use  the  following  procedure: 

For  each  allowable  split  of  the  node  in  question  into  two  nodes,  find 


the  optimal  split  of  each  of  the  two  nodes.  Now  compute  the  decrease  in 
uncertainty  as  we  go  from  the  original  node  to  the  four  "descendant"  nodes 
and  use  this  decrease  to  judge  the  merit  of  the  original  split.  This  gives 
a "one-step  look  ahead"  evaluation  of  the  split.  As  an  analogy,  this  cor- 
responds in  a chess  game  to  a chess  player  who  is  capable  of  looking  ahead 
one  move  into  the  future  as  contrasted  to  a chess  player  that  plays  the 
move  that  yields  him  the  most  immediate  gain. 

The  point  is  that,  without  a look-ahead  potential,  especially  in  the 
later  stages,  we  may  be  selecting  "blind  alley"  splits;  that  is,  splits 
that  look  good  as  immediate  prospects,  but  that  do  not  eventually  lead 
to  high-confidence  nodes. 

We  believe  that  a substantial  improvement  in  decision  tree  classifi- 
cation procedures  can  be  made  by  using  an  i ifcrmation-adaptive  structure. 

By  this  we  mean  a decision  tree  that 

a.  in  the  early  stages  uses  less  detail,  aggregates  the  infor- 


mation, and  splits  the  objects  into  a few  large  general 
categories 

b.  in  the  later  stages,  adds  more  detailed  information, 
examines  a larger  class  of  splits,  and,  if  necessary, 
goes  into  a look-ahead  mode  of  operation. 

This  type  of  tree  structure  also  lends  itself  to  greater  efficiency 
in  construction.  At  the  early  stages,  with  a large  data  base,  each  node 
contains  many  items.  With  the  chemical  spectra  base,  the- first  few  nodes 
contained  thousands  of  compounds.  If  one  used  the  same  degree  of  detail 
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at  this  stage  that  is  desirable  later,  then  the  number  of  potential  splits 
will  number  in  the  thousands.  The  search  through  all  potential  splits 
over  a large  number  of  items  is  very  time-consuming  and  the  rewards  at 
this  level  are  not  commensurate  with  the  effort  expended. 

When  the  nodes  are  small,  then  the  search  over  a larger  number  of 
allowable  splits  is  not  as  burdensome  computationally. 

One  way  an  information-adaptive  structure  could  be  implemented  is  as 
follows:  A number  of  different  classes  ,...  ,Jj  of  splitting  questions 

will  be  defined.  At  the  top  level  is  a relatively  small  class ^ of 
questions  that  uses  highly  aggregated  information  about  the  objects.  At 
the  bottom  level  is  a large  class  Jj  of  splitting  questions  that  uses  the 
most  detailed  information  about  the  object  . The  tree  growing  will  proceed 
as  follows: 

1.  At  any  node,  depending  on  its  size  relative  to  the  original 
total  population,  fix  a threshold  value  T(s),  start  with 
class  where  ^ is  determined  by  T(s),  and  find  the  best 
split  generated  by  this  class. 

2.  If  the  decrease  in  uncertainty  associated  with  the  best 
split  in  is  not  greater  than  the  threshold  value  T(s), 
then  examine,  in  a similar  way,  all  the  allowable  splits 


3.  Continue  in  this  way  until  is  exhausted.  If  no  allowable 
split  exceeds  the  threshold  value,  then 
t.  either  call  the  node  terminal, 


■ 


or 

ii.  go  into  a one-step  look-ahead  decision  mode. 

Which  alternative  we  choose  in  3 above  will  depend  on  the  size  and  confi- 
dence level  of  the  node  in  question. 

The  usefulness  of  highly  aggregated  information  at  the  beginning  of 
the  decision  structure  is  vividly  illustrated  in  chemical  spectra  study. 

If  we  had  used  the  set  of  320  splitting  question,  "Is  the  molecular  weight 
equal  to  k,  k=l ,2,. . . ,320?,"  none  of  these  splits  would  have  caused  an 
appreciable  decrease  in  uncertainty.  The  reason  is  that  only  a small  pro- 
portion of  the  compounds  have  a given  molecular  weight.  The  definition 
of  uncertainty  is  such  that  when  a node  is  split  and  one  of  the  two 
descendent  nodes  has  a very  small  fraction  of  the  parent's  population, 
then  the  decrease  in  uncertainty  is  small.  Thus,  if  we  had  used  the  fu  1 
detail  available  in  the  molecular  weight  information,  the  probability 
is  that  it  would  have  caused  very  little,  if  any,  alteration  in  the 
tree  structure  and  in  the  misclassification  rate. 

However,  aggregating  the  molecular  weight  information  by  dividing 
it  into  the  two  subsets--even  weights  and  odd  weights--and  adding  only 
the  single  question  as  to  which  subset  the  weight  was  in,  led  to  a dif- 
ferent initial  split  and  a drastic  reduction  in  misclassification  rate. 

In  our  thinking  about  the  chemical  spectra  problem,  we  could  see  the 
following  possibilities  of  different  levels  of  aggregation. 


i 
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The  first  level  of  questions  could  include  questions  of  the  form: 

Are  there  two  or  more  main  peaks  in  the  M/e  sequence  k,k+14,k+28,k+42,. . . 


for  k an  integer  between  0 and  13?  Or,  more  generally,  one  could  ask 
questions  of  the  form:  Are  there  j or  more  main  peaks  in  the  subset  of  m/e 

values  (C.j,C2,...,CN)? 

Notice  that  on  this  level  we  would  be  ignoring  the  intensities  of  the 
peaks  and  using  only  their  locations. 

At  the  next  level  ■<£,  of  questions,  the  intensities  might  be  intro- 
duced. For  instance,  a possible  set  of  questions  at  this  level  might  be: 
Are  there  j peaks  with  intensities  i in  the  subset  of  m/e  values 
{Cj ,C2» . . . ,0^1?  Going  down  the  levels,  the  questions  become  more  de- 
tailed. For  instance,  we  may  want  to  go  to  the  level  of  questioning 
of  the  form:  Is  there  a peak  at  m/e  e C-j  with  intensity  i-j  and  a 

peak  at  m/e  t C2  with  intensity  i2? 

At  a eer  ain  level  down  the  tree,  the  isotopic  information  should  be 
added  to  the  main  peak  information.  With  this  added  information,  ques- 
tions concerning  the  ratio  of  intensities  of  main  peaks  and  nearby 
isotope  peaks  may  aid  in  the  separation  of  classes  at  the  nodes. 

A one-  or  multiple-step  look  ahead  procedure  is  costly  in  its  initial 

construction.  For  example,  if  there  are  N splitting  questions  at  the 

o 

level  being  used,  then  for  a one-step  look-ahead  optimization,  2N  splits 
have  to  be  examined.  This  number  comes  from  observing  that  for  such  allow- 
able split  of  the  original  node,  all  allowable  splits  of  the  two  descendant 
nodes  are  examined.  If  there  are,  as  in  our  study,  about  1000  allow- 
able splits,  then  a one-step  look-ahead  optimization  would  have  to  examine 
2 10®  splits..  Hence,  the  one-step  look-ahead  procedure  would  have  to  be 
planned  in  a modified  form  in  order  to  be  feasible. 
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.Other  Splitting  Criteria  and  Iterative  Cleansing 
As  we  mentioned  before,  the  uncertainty  has  the  property  that  if  a 
split  results  in  one  node  much  smaller  than  the  other,  then  there  is  not 
much  decrease  in  uncertainty.  This  may  be  undesirable  in  some  situations. 
For  example,  suppose  there  are  5000  objects  in.  a node,  and  a certain 
split  results  in  a node  containing  100  objects  and  another  with  the  4900 
remaining  objects.  Even  if  the  node  with  100  is  absolutely  pure,  ♦■hat  is, 
consists  entirely  of  one  class,  the  split  will  not  produce  much  of  a 
reduction  in  uncertainty  and  will  almost  surely  not  be  the  optimal  choice. 

It  is  possible  that  we  want  to  choose  our  criteria  for  goodness-of- 
split  so  that  it  gives  a higher  weight  to  splits  that  produce  fairly 
pure  nodes  that  are  above  some  minimal  siz^. 

For  instance,  at  one  extreme,  we  coulc  adopt  this  strategy:  Find 

the  split  that  produces  the  largest  node  having  more  than  90  percera 
purity  (i.e.,  such  that  more  than  90  percent  of  its  population  is  in  one 
class).  Having  split  off  this  node,  repeat  the  procedure  on  the  other 
node  until  no  90-percent  pure  node  greater  than  some  minimum  size  can  be 
found.  Now  that  we  have  chipped  away  all  the  above  90-percent  pure  nodes, 
we  lower  our  level  to  80  percent  and  search  for  the  largest  node  having 
at  least  80  percent  purity.  Having  found  an  80-percent  node,  we  then  try 
to  extract  an  above  90  percent  chip  off  of  it  by  splitting. 

This  chipping  a way  of  small  pure  nodes  procedure  does  not  seem 
desirable  near  the  early  stages  of  tree  structure.  At  this  initial  phase, 
the  best  thing  to  do  is  to  get  the  population  roughly  sorted  into  large 


24 


subsets  such  that  each  contains  significant  numbers  of  one  or  more  classes, 
but  excludes  most  of  the  members  of  one  or  more  of  the  remaining  classes. 

At  the  advanced  levels,  when  we  try  to  split  nodes  containing  only  two  or 
three  classes,  the  "pure  chip"  strategy  itself  or  in  combination  with  the 
uncertainty  criteria  may  produce  more  accurate-classification. 

Thus,  we  may  want  to  adapt  our  splitting  criteria  to  the  depth  of 
the  part  of  the  tree  we  are  working  cn. 

There  are  an  infinity  of  other  criteria  for  judging  goodness-of-split. 
But  in  a sense,  the  "pure  chip"  and  reduction  of  uncertainty  criteria 
stand  at  opposite  ends  of  this  continuum  of  possibilities. 

Even  with  improved  tree-growing  procedures,  it  is  inevitable  that 
some  of  the  terminal  nodes  will  have  low- confidence  levels,  where  the 
confidence  level  is  defined  as  the  percentaie  of  the  largest  class  in  the 
total  population  of  the  node.  A simple  iterative  procedure  for  cleansing 
the  low-confidence  terminal  nodes  is  to  pool  the. populations  of  all 
terminal  nodes  with  a confidence  level  below  a pre-set  level  (say,  for 
instance,  80  percent)  and  using  this  pooled  population  as  the  initial 
population , rerun  the  tree  construction  program.  Now  that  the  fairly 
pure  terminal  node  populations  have  been  pulled  out,  the  tree  structure 
for  classifying  the  remaining  population  should  be  entirely  different 
than  the  original  tree. 

This  iteration  can,  of  course,  be  repeated  over  and  over,  but  we 
strongly  suspect  that  the  point  of  diminishing  returns  wi.ll  appear  after 
one  or  two  iterations.  Still,  we  anticipate  that  a substantial  improve- 
ment can  be  produced  by  this  iterative  cleansing. 
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Further  Moves  Toward  a Numerous  Class  High  Dimensional  Tree  Classifier 
The  problem  we  want  to  ultimately  be  able  to  solve  is  something  like 
spoken  word  recognition  with  a large  possible  vocabulary,  say  1000  words. 

We  want  to  be  able  to  recognize  an  individual  word,  no  matter  who  speaks 
it.  Now  attached  to  each  spoken  word  is  the  high  dimensional  measurement 
vector  consisting  of  the  digitized  recording  of  the  word. 

Oui  planned  strategy,  in  attacking  this  problem,  will  be  to  first  use 
highly  aggregated  information  to  separate  the  words  into  a few  subgroups. 

For  instance;  is  it  a multisyllable  word?  Does  a peak  appear  near  the 
beginning  of  the  word  in  this  frequency  range? 

The  tree  structure  will  be  information  adaptive,  at  each  node,  the 
subgroup  of  word  classes  present  will  be  broken  down  in  smaller  subgrocps. 

Obviously,  this  is  a sensible  approach.  The  question  is  how  to 
implement  it?  We  will  describe  a general  framework  and  then  look  at  some 
specific  examples.  .' 

Step  I:  Select  a family  TTQ  } , e e 6 , which  maps  the  measurement 

vectors  of  the  learning  set  into  a low  dimensional  space  (feature  space). 

Step  II:  Using  the  values  = Tq(>c)  corresponding  to  the  feature  vectors 

of  the  job  class,  estimate  the  density  f (y).  If  the  apriori 

probabilities  of  class  J are  p.  then  the  probability  that  an  unknown 

J 

item  with  feature  map  ^ belongs  to  class  J is  defined  to  be 


P(j|y*)  = 


Pj  fj(y*) 

S W*> 


c 
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.Step.  HI.  For  all  items  in  the  i*^  class,  let  be  the  set  of  feature 
vectors.  Define 

"u  ■ * p<j  w 

y.£Ci 

Step  IV.  Take  the  "confusion"  matrix  ||n-.||  and  cluster  the  classes 

* J 

into  similar  groups.  Define  a measure  of  misclassification  between 
groups.  After  the  clustering,  let  this  measure  be  M(o). 

Step  V.  Select  e to  minimize  M(o). 

Step  VI.  Repeat  the  above  5 steps  on  each  of  the  subgroup  of  classes 
defined  in  the  4th  & 5th  steps  above,  using  another  family  of  feature  maps. 

This  general  framework  gives  rise  to  i decision  tree  structure, 
although  it  is  not  necessarily  binary.  It  can  be  made  binary  by  clustering 
the  classes  into  two  groups  at  each  stage,  but  this  is  unnecessarily 
refrictive.  A better  alternative  that  a binary  grouping  on  any  fixed  number 
of  dissimilar  groups  is  to  allow  the  existence  of  a group  of  "dubious" 
classes.  So,  for  example,  a good  ternary  clustering  world  consists  of 
two  extremely  dissimilar  groups  of  classes  and  a "dubious"  group  between 
them. 

This  approach  is  a combination  of  a number  of  methodologies,  including 
clustering.  One  problem  we  face  in  implementing  an  algorithm  of  the  above 
type  is  to  find  an  effective  way  of  clustering  the  classes  into  subgroups 
at  each  stage.  We  have  developed  one  such  algorithm  which  is  described  in 
Appendix  D. 


27 


Other  Related  Research 

In  his  extensive  1974  review  [1]  of  pattern  recognition  methods,  Laveen 
Kanal  states  "For  the  multi  cl  ass  case,  most  of  the  work  has  either  cast  the 
M-class  problem  as  M(M-l)/2  two-class  problems  or  employed  multidimensional 
scatter  ratios  popular  in  classical  multiple  discriminant  analysis."  Thus, 
the  use  of  tree  structured  decision  methods  in  multiclass  problems  (li  large) 
is  relatively  novel  in  the  area  of  classification  or  pattern  recognition. 
However,  there  is  some  scattered  work  in  the  literature  that  is  relevant 
to  our  proposed  research.  One  of  the  earliest  works  in  the  field  proposing 
the  use  of  tree  structured  classification  is  due  to  W.  Meisel  [2],  the  pro- 
posed co-principal  investigator.  Some  related  but  very  specialized  results 
regard  "tree  grammars."  These  are  referenced  in  Kanal1 s work  [1], 

We  are  certainly  not  the  first  to  propose  the  use  of  the  uncertain'.y 
measure  cS  a splitting  criteria,  and  the  idea  has  appeared  sporadically  [3,4]. 
Tree  structured  decision  methods  appear  very  prominently  in  clustering  theory. 
Rartigan's  recent  and  excellent  survey  book  [5],  contains  a full  discussion 
and  description  of  their  use. 

One  of  the  most  successful  use  of  tree  structured  decision  methods  is 
the  nonlinear  regression  program  AID  and  its  successors  developed  at  the 
University  of  Michigan  and  widely  used  in  the  social  sciences,  [4,6],  In 
this  algorithm  a search  is  made  for  successive  splits  over  the  independent 
variable  space.  The  splitting  criterion  is  the  reduction  in  mean  square 
error  in  predicting  the  dependent  variable.  The  result  is  a tree  structure 
of  binary  splits  of  the  data  points. 


Another  use  of  decision  tree  structures  is  in  file  search  procedures. 


A good  account  appears  in  [7], 
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